Efficient Fault Tolerance: An Approach to Deal with Transient Faults in Multiprocessor Architectures
نویسندگان
چکیده
Dynamic error processing approaches are an important mechanism to increase the reliability in a multiprocessor system, while making efficient use of the available resources. To this end, dynamic error processing must be integrated with a fault treatment approach aiming at optimising resource utilisation. In this paper we propose a diagnosis approach that, accounting for transient faults, tries to remove units very cautiously and to balance between two conflicting requirements. The first is to avoid the removal of units that have experienced transient faults and can be still useful for the system and the other is to avoid to keep failed units whose usage may lead to a premature failure of the system. The proposed fault treatment approach is integrated with a mechanism for dynamic error processing in a complete fault tolerance strategy. Reliability analyses based on the Markov approach and an efficiency evaluation performed by simulation are carried out.
منابع مشابه
Reliability and Performance Evaluation of Fault-aware Routing Methods for Network-on-Chip Architectures (RESEARCH NOTE)
Nowadays, faults and failures are increasing especially in complex systems such as Network-on-Chip (NoC) based Systems-on-a-Chip due to the increasing susceptibility and decreasing feature sizes. On the other hand, fault-tolerant routing algorithms have an evident effect on tolerating permanent faults and improving the reliability of a Network-on-Chip based system. This paper presents reliabili...
متن کاملTransient Processor/Bus Fault Tolerance for Embedded Systems
We propose an approach to build fault-tolerant distributed real-time embedded systems. From a given system description (application algorithm and architecture) and a given fault hypothesis (type and number of faults to be tolerated), we generate automatically a static fault-tolerant multiprocessor schedule of the algorithm components on the target architecture, which minimizes the schedule leng...
متن کاملAdaptable Fault Tolerance Configurations for Multiprocessor Systems
The escalating increase in the complexity of multiprocessor systems increases the probability of faults occurring in these systems As a consequence there is a great need for achieving fault-tolerance of processing in multiprocessor systems. Faulttolerance generally requires some forms of hardware and/or time redundancy. Two fault tolerant configurations are proposed for both single and double t...
متن کاملDesigning Fault-Tolerant System Using Automorphisms
This paper presents a general theory for modeling and designing fault-tolerant multiprocessor systems in a systematic and efficient manner. We are concerned here with structural fault tolerance, defined as the ability to reconfigure around faults in order to preserve the interconnection structure of a multiprocessor. We represent multiprocessor systems by graphs whose node sets denote processor...
متن کاملThunderGeckoMonkey: An Energy-Aware High Performance Secure Computing System
This paper presents ThunderGeckoMonkey, a high-performance, energy-efficient processor aimed at On-Line Transaction Processing (OLTP) workloads. ThunderGeckoMonkey also offers enhanced, transparent fault tolerance and hardware support for a wide range of common encryption and hashing algorithms. A chip multiprocessor, ThunderGeckoMonkey is based on 4issue out-of-order cores, each utilizing adva...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1994